Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

Support Unicode international characters #151

Merged
merged 15 commits into from
Feb 15, 2017
Merged

Support Unicode international characters #151

merged 15 commits into from
Feb 15, 2017

Conversation

WesTyler
Copy link
Contributor

@WesTyler WesTyler commented Feb 7, 2017

  • Punycode domains
  • utilize .codePointAt instead of .charCodeAt
  • tighten restricted codes to control characters instead of non-latin ranges

RFC references:

  • Internationalization of email addresses: RFC 6530
    • Both local and domain portions should support all Unicode characters with the exception of C0 and C1 control characters
    • Unicode characters in domains should be downgraded (via punycode) when interacting with ASCII-only systems, as when resolving MX records in the case of isemail

Addresses #17

  - Punycode domains
  - utilize .codePointAt instead of .charCodeAt
  - tighten restricted codes to control characters instead of non-latin ranges
@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 7, 2017

@skeggse this is going to fail linting because of the let Dns issue with rewire.

Please let me know if I missed anything, or if this is not what you had in mind. I'm happy to push updates to this PR.

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 7, 2017

Actually, it occurs to me that I can escape that line from the linter. I'm going to push a commit for that.

Copy link
Owner

@skeggse skeggse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add a few tests that verify the correct byte length checking behavior for long unicode email addresses - iirc the length limits on email addresses are in numbers of octets, as captured here and here.

Can you comment on whether this line is still correct, given the check against 127? Same for this line, this one, and this one.

Please bump the version to at least 2.3.0. We might consider either going to 3.0.0, or adding an allowUnicode flag to selectively enable this (which would be more complicated). It's possible others depend on this module rejecting unicode email address, and it's also possible that they didn't consider it, and will start having issues if they accept a unicode email address, but can't handle it in their delivery mechanism.

EDIT: sorry for making these concerns kinda late in the process - they did not occur to me sooner.

test/tmp.js Outdated
checkDNS: true
}, () => {

expect(internals.punyDomain).to.equal('xn--5nqv22n.xn--lhr59c');
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but depends on the exact behavior of punycode. I'd rather not test whether punycode did the right thing, just whether validate called punycode correctly. Instead, could you put this?

expect(punycode.toUnicode(internals.punyDomain)).to.equal('伊昭傑@郵件.商務');

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 7, 2017

Thanks for the quick review. Addressing these issues now. 👍

@skeggse
Copy link
Owner

skeggse commented Feb 7, 2017

Ah, I also just noticed RFC 6532, which, in Section 3.1 discusses the appropriate normalization form for the email address. It looks like String#normalize implements the appropriate conversion - do you think it would be appropriate to preprocess the email address by calling normalize on it?

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 7, 2017

I missed RFC 6532... Reading through it now.

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 7, 2017

Thanks @robhorrigan :D

@skeggse re: normalization - I don't know that it is necessary, and it may cause some unintended side effects. I'm looking at Section 4 of RFC 6532. Specifically the fact that normalization can cause the length of the address to exceed requirements/overflow buffers in edge cases.

Copy link
Contributor

@Marsup Marsup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented on the how, not the what, as you know more than me after reading that RFC :)

@skeggse you should keep versioning under your control and not part of this PR.

lib/index.js Outdated

// add C0 control characters

for (let i = 0; i < 33; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this package only supports node 4+, you can probably use lookup.fill to initialize it.

package.json Outdated
@@ -18,10 +18,12 @@
"node": ">=4.0.0"
},
"dependencies": {
"punycode": "^2.1.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep hapi versioning scheme (2.1.x).

@@ -0,0 +1,47 @@
'use strict';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmp.js ??

Copy link
Contributor Author

@WesTyler WesTyler Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skeggse's preference is to utilize a proper internationalized domain with an active MX record. I have struck out on finding such a domain, so this approach was his backup plan. I'm guessing tmp.js was so that it doesn't become the permanent testing solution by default. :P

lib/index.js Outdated
@@ -2,8 +2,8 @@

// Load modules

const Dns = require('dns');

let Dns = require('dns'); // eslint-disable-line prefer-const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think proxyquire would allow you to keep that const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skeggse - Do you prefer rewire or is it worth it to utilize proxyquire to prevent the ESLint override?

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 7, 2017

@robhorrigan and I have started on addressing the changes you requested @skeggse. Planning on having the additional tests and refactors up tomorrow AM. We're close :)

@skeggse
Copy link
Owner

skeggse commented Feb 7, 2017

@WesTyler after rereading Section 4 of RFC 6532, I'm not sure I agree with your comment. I read that section as saying that normalization should mitigate some of the security issues.

I'm also concerned about introducing this change without also providing a warning to users reminding them to normalize Unicode emails. Thoughts?

Also thanks for the input, @Marsup. I'll update the version separately. @WesTyler is right about tmp.js, though I continue to be open to a better file name.

Per proxyquire, I'm not sure. I prefer it when we don't make changes to source specifically for testing.

@Marsup
Copy link
Contributor

Marsup commented Feb 7, 2017

proxyquire acts on the require cache, it seems better than rewire for your usage.

@skeggse
Copy link
Owner

skeggse commented Feb 7, 2017

Ah, I think I misread the docs for proxyquire when I skimmed them. Yeah, that looks good.

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 8, 2017

@skeggse You're absolutely right about the normalization. I misread that section and missed the key part:

The normalization process described
in Section 3.1 is recommended to minimize these issues.

Here are my action items for this PR:

  • Address conditional statements to ensure that the correct charCode is caught in the correct places.
  • Update test case in temp.js as requested.
  • Pull out rewire and implement proxyquire instead. This will remove the linting exception.
  • Add preprocessing normalization of emails.

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 8, 2017

@skeggse - need to call in the big guns on this.
I am verifying the charCode conditional here. My interpretation based on that range is that it is targeting C1 controls except for 127 (delete) (the C1 range is 127-159).
The test case that covers this block is "\"test\\©\"@iana.org". Was the intentionally chosen based on the RFC, or was it just a character with Unicode decimal value >127?

@skeggse
Copy link
Owner

skeggse commented Feb 8, 2017

This project was originally ported from the is_email php function. I pulled nearly all the test-cases from that project. My guess is that it was only chosen because it was outside the ASCII range.

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 8, 2017

@skeggse Ok, we are at the last change requested - normalization.

It looks like Node.js is already handling UTF-8 encoded unicode characters.
'\u0101' === 'ā' // true
Also, we tried to set up a test to assert that normalization is taking place as expected by calling Isemail.validate('test\u0101@\u0101.com'). The parameter at this point in the code already has the Unicode character in place instead of the UTF-8 string.
I have been unable to replicate an email address with UTF-8 encoding of Unicode characters for which calling email.normalize() had any effect.

Thoughts??

@skeggse
Copy link
Owner

skeggse commented Feb 8, 2017

JavaScript strings are UCS-2 encoded, not UTF-8, so you'll need to try a character that needs to be represented as a surrogate pair in UCS-2.

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 8, 2017

Oh geez. It's been a long day... XD

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 8, 2017

Ok, so, this is as close as @robhorrigan and I have been able to get for writing tests to cover normalization:

> punycode.toUnicode(punycode.toASCII('man\u0303ana.com')) === 'mañana.com'
false
> punycode.toUnicode(punycode.toASCII('man\u0303ana.com'.normalize())) === 'mañana.com'
true

So our thought is to pass in the email 'testing@man\u0303ana.com' in tmp.js and assert

expect(punycode.toUnicode(internals.punyDomain)).to.equal('mañana.com');

Does that meet your expectations for normalize?

@skeggse
Copy link
Owner

skeggse commented Feb 8, 2017

Yeah that'll do for now.

  - adds UTF-16 surrogate pair characters to tests.json
  - adds exported `.normalize` method and associated test
  - adds test in `tmp.js` for punycoded normalized characters
@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 9, 2017

@skeggse - it looks like the behavior for String.prototype.normalize changes in the V8 versions used between Node v4.x and Node v6.x and it is breaking existing tests containing NULL characters \u0000.

https://github.com/v8/v8/blob/master/ChangeLog#L728

Node v6.x (all existing tests pass):

> '\"test\u0000\"@iana.org'.normalize()
'"test\u0000"@iana.org'

Node v4.x (normalize breaks existing tests with NULL):

> '\"test\u0000\"@iana.org'.normalize()
'"test'

[EDIT] Working on a nulNormalize helper to effectively backport the V8 fix to Node v4.x
Here's the V8 commit that corrected the normalize functionality

@skeggse
Copy link
Owner

skeggse commented Feb 10, 2017

Ok, awesome! (technically there's no monkey-patching in that last commit, but it is certainly a workaround)

I'll review the whole thing one more time this evening.

@WesTyler
Copy link
Contributor Author

Ha my bad, I guess you're right. I was thinking monkey-patch since it was checking process.version, but that doesn't mean it's modifying the existing .normalize functionality at runtime. :P

Thanks for the patience and help with this throughout the week

lib/index.js Outdated
@@ -1413,5 +1426,12 @@ exports.diagnoses = internals.validate.diagnoses = (function () {

exports.normalize = internals.normalize = function (email) {

// $lab:coverage:off$
if (process.version[1] === '4' && email.match(/\0/g)) {
Copy link

@lamchakchan lamchakchan Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend replacing regex with a simple indexOf for performance reasons.

if (process.version[1] === '4' && email.indexOf('\u0000') === 0) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it. Just haven't pushed up yet :)
Also, using email.indexOf('/u0000') >= 0 because the NUL is not necessarily at the beginning

Copy link
Owner

@skeggse skeggse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop that iterates over the characters in the string won't handle surrogate pairs. Is this appropriate? It seems like it would be better to consume an entire unicode code point for each iteration of that loop. Furthermore, on the next line, the token = email[i]; won't handle surrogate pairs either. Maybe token = email.codePointAt(i) and i += token.length or something. Please double-check the use of i as an iterator, including its uses inside the loop.

I added some length checks that fail, as they don't consider the octet count correctly.

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 11, 2017

Ok, I think I found a way to check for surrogate pairs. I've modified the iterator to grab the entire pair as the token and then "skip" the second octet. I'm pushing up that commit so you can check the modification, but the tests are still failing.

My question now is about the test cases and the asserted results. For the surrogate pair \ud83d\ude06, that should be considered a single character when calculating the email length, right? If so, then I believe the test cases you added may be asserting incorrectly. It should take 65 of the \ud83d\ude06 pairs to trigger the "rfc5322LocalTooLong" error, not 17... I have similar concerns about the other 4 tests cases if I'm right about this one.

Also, if that is the case, I need to update the way the parseData.local.length and parseData.domain.length lengths are being checked to account for '\ud83d\ude06'.length === 2 instead of the desired '\ud83d\ude06'.length === 1

@skeggse
Copy link
Owner

skeggse commented Feb 11, 2017

It's the exact opposite, actually. The original rules about limits on the number of octets for, say, the entire local part still apply. As a result, a should could as one octet, but \ud83d\ude06 should count as four octets.

I just realized the tests for the domain part might be wrong, though - the punycode version is what would actually be sent over the wire.

@WesTyler
Copy link
Contributor Author

WesTyler commented Feb 11, 2017

\ud83d\ude06 should count as four octets

Wait, what?? What in the world am I reading horribly wrong?? Haha

@skeggse
Copy link
Owner

skeggse commented Feb 11, 2017

Buffer.byteLength('\ud83d\ude06', 'utf8') === 4

@WesTyler
Copy link
Contributor Author

Ohhhhhhh. I see now. Got it, thank you.

I updated the iterator, but I still do need to account for the difference here:

> '\ud83d\ude06'.length
2
> Buffer.byteLength('\ud83d\ude06', 'utf8')
4

@skeggse
Copy link
Owner

skeggse commented Feb 11, 2017

Try

for (let i = 0; i < emailLength; i += token.length) {
  token = email.codePointAt(i);

@WesTyler
Copy link
Contributor Author

Is there something wrong or undesirable with the iterator update in this commit?

@WesTyler
Copy link
Contributor Author

If not, then I think my latest update (checking Buffer.byteLength instead of string.length) may be the final piece.
I had to update a couple of my international character tests because

> Buffer.byteLength('郵', 'utf8')
3

@skeggse
Copy link
Owner

skeggse commented Feb 11, 2017

The method you used to pull surrogate pairs out of the string functions. That said, it doesn't add anything - it makes the code harder to read, and I'd prefer readable and concise code.

It's also misleading: within the loop, we expect i to reference the start of the current character being processed. That said, the way we use it is inconsistent with surrogate pairs.

There are a number of lines that look like:

if (emailLength === ++i || email[i] !== '\n') {

This will break in my proposed iterator. I'd still prefer to move in that direction specifically because it makes i more consistent with my expectations. I can make these changes to your branch, if you want.

@WesTyler
Copy link
Contributor Author

Alright, that's fair.
token = email.codePointAt(i); doesn't work though. The rest of the logic requires that token is the actual substring of the email (like '\ud83d\ude06').

I'll see what I can do to clean it up though.

@skeggse
Copy link
Owner

skeggse commented Feb 11, 2017

Ah, crud. You're right. token = String.fromCodePoint(email.codePointAt(i));

@WesTyler
Copy link
Contributor Author

I can make these changes to your branch, if you want.

I would be on board with that if you have the time. Gotta try to get some time in with my wife and baby daughter this weekend :)

@WesTyler
Copy link
Contributor Author

@skeggse I had the chance to replace my internals.checkSurrogatePair with the

for (let i = 0; i < emailLength; i += token.length) {
  token = String.fromCodePoint(email.codePointAt(i));

I didn't end up needing to change any of the iterator-related code, only the logic in quote pairs that prepends '\\' to the token.

Is this closer to what you were hoping to see?

@skeggse
Copy link
Owner

skeggse commented Feb 15, 2017

Wonderful! I'll get these out on 2.3.0. Sorry for last weekend - I meant to take care of this but too many other things piled up.

@skeggse skeggse merged commit 0f795c5 into skeggse:master Feb 15, 2017
@WesTyler
Copy link
Contributor Author

Hey, no worries! I ended up having some bandwidth today and it wasn't as involved as I was worried it would be.

Thanks again for all of the patience and help with this :D

skeggse pushed a commit that referenced this pull request Jun 22, 2017
This was left over from some tests introduced in #151 (0f795c5) and removed in #153 (ac9acbd).
// Check if it's a neither a number nor a latin letter
else if (charCode < 48 || charCode > 122 || (charCode > 57 && charCode < 65) || (charCode > 90 && charCode < 97)) {
// Check if it's a neither a number nor a latin/unicode letter
else if (charCode < 48 || (charCode > 122 && charCode < 192) || (charCode > 57 && charCode < 65) || (charCode > 90 && charCode < 97)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WesTyler I'm reading through some of this code as I refactor, and now I find myself asking where 192 came from. You wouldn't happen to recall, would you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I believe 192 is the beginning of the Unicode "international" character set (À).

If I remember correctly, Unicode #s 123-191 are all non-character symbols.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, ok. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants